Accelerator Architecture
for Sparse High-Order
Tensor Contraction

Gabriel Kulp, Andrew Ensinger
February 28, 2023

Introduction

FLAASH: Flexible Accelerator Architecture
for Sparse High-Order Tensor Contraction

Gabriel Kulp, Andrew Ensinger, Lizhong Chen

Paper submitted to ICML this month

Roadmap

  1. Introduction (you are here)
  2. Background and Motivation
    • What is sparse tensor contraction?
    • What are accelerators?
  3. Methods and Results
    • What did we do?
    • Did it work?
  4. Future Work
    • Where can we collaborate?

Summary

  • We designed an architecture in simulation
  • It contracts sparse high-order tensors
  • We also implemented it in Verilog
  • The benchmark does well (~25x)

Background

Things I want to teach you

  • What kind of math is this?
  • Why would anyone want that?
  • What have others done before us?

What is a tensor?

  • Multi-dimensional array
  • Represents a multilinear transform
  • Many applications

Vocab

  • Mode: single dimension/direction
  • Order: how many modes
  • Fiber: row, column, etc.
  • Coordinate: which row and column?
  • Volume: product of mode lengths

High-Order? Sparse?

“High-order” ≈ more than 3 modes

Tensor contraction

  • Matrix multiplication but general
  • Choose one direction each, then dot everything

What is an accelerator?

  • CPUs are general
  • GPUs are more specific
  • GPUs are accelerators
  • Accelerators do something specific
  1. CPU gives the problem to the accelerator
  2. Accelerator solves it
  3. Results sent back to CPU

Challenges

  1. Control flow
  2. Memory access
  3. Workload balance

Methods

Overview

Four parts

  1. Tensor Memory
  2. Job Generation and Dispatch
  3. Processing Elements (“workers”)
  4. Input/Output Interface

Tensor Memory

  • Convert coordinates to pointers
  • Manage data structures
    • Bookkeeping
    • Dynamic maintenance
  • Manage caching

Job Generation and Dispatch

  • List all dot products (on-demand)
  • Queue them up
  • Send to workers
  • Wait until queue is empty and workers are done

Sparse Dot Product Engines

Five parts:

  1. Left fiber loader
  2. Right fiber loader
  3. Intersection and MAC
  4. Storage queue
  5. Job queue

Worker function

  • Fetch fibers from operands
  • “Zip” together the fibers
  • Seek “intersections” where index matches
  • Multiply and accumulate
  • Send result to memory when done

Input/Output Interface

  • Handle allocation and transfer of tensors
  • Provide connection to physical memory
    • Queuing and batching for DRAM
    • Or DMA to the host

Results

How did I get results?

  • Wrote accelerator in Verilog
  • Simulate performance with Xilinx Vivado
  • Compare simulated performance to software
  • Baseline is PyTorch and Tensorflow

How did it do?

  • Linear time complexity as expected
  • Believable speedup over existing tools
  • Reasonable clock speed and area

Conclusion

Future work

  • Optimize jobs for cache residency
  • Use a hash tree instead of CSF
  • Integrate with numpy and/or PyTorch

Thank You!

Questions?

  • Schartz Rehan, “The Shape of Tensor”
  • Wikipedia “Adjacency Matrix”
  • Majestic SEO “Link Graph”
  • Apple Dev Documentation “Sparse Solvers”
  • XKCD Comic 2347
  • Philipp Niemann, et al. “Advanced exact synthesis of Clifford+T circuits”
  • Silvia Bonfanti, et al. “Molecular mechanisms of heterogeneous oligomerization of huntingtin proteins”
  • NVIDIA dev blog, “Accelerating Inference with Sparsity”